System and Method for Managing Size of Clusters in a Computing Environment

ABSTRACT

A number of hosts in a logical cluster is adjusted up or down in an elastic manner by tracking membership of hosts in the cluster using a first data structure and tracking membership of hosts in a spare pool using a second data structure, and upon determining that a triggering condition for adding another host is met and that all hosts in the cluster are being used, selecting a host from the spare pool, and programmatically adding an identifier of the selected host to the first data structure and programmatically deleting the identifier of the selected host from the second data structure.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a division of and claims the benefit of U.S. patentapplication Ser. No. 15/368,381, entitled “System and Method forManaging Size of Clusters in a Computing Environment,” and filed Dec. 2,2016, which is hereby incorporated by reference in its entirety.

BACKGROUND

A virtual machine (VM) is a software abstraction of a physical computingsystem capable of running one or more applications under the control ofa guest operating system, where the guest operating system interactswith an emulated hardware platform, also referred to as a virtualhardware platform. One or multiple VMs and a virtual hardware platformare executed on a physical host device, such as a server-class computer.VMs are frequently employed in data centers, cloud computing platforms,and other distributed computing systems, and are executed on thephysical host devices of such systems. Typically, these host devices arelogically grouped or “clustered” together as a single logical construct.Thus, the aggregated computing and memory resources of the cluster thatare available for running VMs can be provisioned flexibly anddynamically to the various VMs being executed.

However, there are also drawbacks to organizing host devices in clusterswhen executing VMs. For example, when cluster utilization is nearlyfull, i.e., when computing, memory, and/or networking resources of acluster are fully utilized, VM availability can be compromised and VMlatency increased significantly. While the performance of VMs in acluster with high utilization can be improved by a system administratormanually adding host devices to the cluster and/or migrating VMs acrossclusters (e.g., to a less utilized cluster), such customizations aregenerally not scalable across the plurality of clusters included in atypical distributed computing environment and require VMs to be powereddown. Further, performing such manual customizations in real time inresponse to dynamic workloads in a cluster is generally impracticable.Instead, manual customization of clusters is typically performed on aperiodic basis, e.g., daily or weekly.

In addition, to maximize VM availability, clusters of host devices ofteninclude reserved failover capacity, i.e., host devices in the clusterthat remain idle during normal operation and are therefore available forexecuting VMs whenever a host device in the cluster fails. Such reservedfailover capacity can make up a significant portion of the resources ofa cluster, but are infrequently utilized. For example, for a distributedcomputing system that includes 50 clusters, where each cluster includes10 host devices and has a failover capacity of 20%, then the capacityequivalent to 100 host devices are unused in the system until a failureoccurs. Because failures are relatively infrequent, the majority of thisreserved failover capacity is infrequently utilized, thereby incurringboth capital and operational costs for little benefit.

SUMMARY

According to embodiments, a number of hosts in a logical cluster isadjusted up or down in an elastic manner. A method of adjusting thenumber of hosts in the cluster, according to an embodiment, includes thesteps of tracking membership of hosts in the cluster using a first datastructure and tracking membership of hosts in a spare pool using asecond data structure, and upon determining that a triggering conditionfor adding another host is met and that all hosts in the cluster arebeing used, selecting a host from the spare pool, and programmaticallyadding an identifier of the selected host to the first data structureand programmatically deleting the identifier of the selected host fromthe second data structure.

Further embodiments provide a non-transitory computer-readable mediumthat includes instructions that, when executed, enable a computer toimplement one or more aspects of the above method, and a system ofcomputers including a management server that is programmed to implementone or more aspects of the above method.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a computing environment, according to one embodiment.

FIG. 2 is a conceptual diagram that depicts an expanded view of thehardware platform of a computer host of FIG. 1, according to one or moreembodiments.

FIGS. 3A-3C are conceptual diagrams illustrating the logical removal ofan available host from a spare host pool and the logical addition of theavailable host to one of the clusters, according to one or moreembodiments.

FIG. 4 sets forth a flowchart of method steps carried out by the VMmanagement server of FIG. 1 to address low utilization in a cluster,according to one or more embodiments.

FIG. 5 sets forth a flowchart of method steps carried out by the VMmanagement server of FIG. 1 to address high utilization in a cluster,according to one or more embodiments.

FIG. 6 sets forth a flowchart of method steps carried out by the VMmanagement server of FIG. 1 in response to a failure of a host in aparticular cluster, according to one or more embodiments.

FIG. 7 sets forth a flowchart of method steps carried out by the VMmanagement server of FIG. 1 in response to a partial failure of a hostin a particular cluster, according to one or more embodiments.

DETAILED DESCRIPTION

FIG. 1 illustrates a computing environment 100, according to oneembodiment. Computing environment 100 is a virtual data center, alsoreferred to as a software-defined data center, and includes multipleclusters of host devices, or “clusters” 120, a spare host pool 130 ofavailable physical host devices, or “hosts,” a virtual machine (VM)management server 140, and, in some embodiments, a failed host pool 150.Computing environment 100 may include multiple virtual data centersresiding within a single physical data center. Alternatively oradditionally, the virtual components of computing environment 100 may bedistributed across multiple physical data centers or locations.

Each cluster 120 of computing environment 100 includes a plurality ofhosts 121A-121N (referred to collectively as hosts 121), each configuredto execute one more VMs 122. According to various embodiments, thenumber N of hosts 121 logically included in a particular cluster 120 canbe varied from a predetermined minimum value, for example four, to amaximum implementable value, for example 64 or 128. More specifically,one or more hosts 121 are logically added to and/or removed from aparticular cluster 120 in response to utilization of the clusterexceeding a maximum utilization threshold value, falling below a minimumutilization threshold value, and/or detection of a partial or total hostfailure. One embodiment of hosts 121 is described in greater detailbelow in conjunction with FIG. 2.

Spare host pool 130 includes a plurality of available hosts 131A-131M(referred to collectively as hosts 131). According to variousembodiments, the number M of hosts 131 logically included in spare hostpool 130 can be varied dynamically. Specifically, as available hosts 131are logically removed from spare host pool 130 and logically added toany of clusters 120 as additional hosts 121, the number M of availablehosts 131 in spare host pool 130 decreases. Likewise, as hosts 121 arelogically removed from any of cluster 120 and logically added to sparehost pool 130 as available hosts 131, the number M of available hosts131 in spare host pool 130 increases. Each of available hosts 131 may besubstantially similar in physical configuration to hosts 121, which aredescribed in greater detail below in conjunction with FIG. 2.

In some embodiments, spare host pool 130 may include available hosts 131that are provisioned to computing environment 100 as cloud-basedresources, and consequently are located at one more physical locationsand/or data centers that are remote from the hosts 121 included in oneor more of clusters 120. In other embodiments, available hosts 131 arelocated in the same physical location and/or data center as the hosts121 included in one or more of clusters 120.

VM management server 140 implements a management plane of computingenvironment 100 and is configured to manage hosts 121 and availablehosts 131 associated with computing environment 100. More specifically,VM management server 140 is configured to manage interactions betweenhosts 121, to determine when available hosts 131 are added to aparticular cluster 120, and to determine when hosts 121 are removed froma particular cluster 120 and added to spare host pool 130. To that end,VM management server 140 includes a high availability (HA) module 141, adistributed resource scheduler (DRS) module 142, and a host provisioningmodule 143. HA module 141 and DRS module 142 are each configured for acomputing environment in which clusters 120 are elastic, i.e., eachcluster 120 can be dynamically increased and decreased in size inresponse to host failure and/or cluster utilization levels.

VM management server 140 typically runs on a computer host that isseparate from hosts 121 and is accessible by a system administrator viaa graphical user interface (GUI) or via a command line interface (CLI).In some embodiments, VM management server 140 may be configured toenable a system administrator to perform various management functionsfor computing system 100 in addition to the automated functionality ofHA module 141 and DRS module 142. For example, VM management server 140may be configured to enable a system administrator to deploy VMs 122 tothe different hosts 121 in a particular cluster, monitor the performanceof such VMs 122, and/or manually power up and power down VMs 122.

HA module 141 is configured to ensure availability of VMs 121 executingin each cluster 120 in response to a partial or total host failure. Forexample, when one of hosts 121 in a particular cluster 120 experiences asoftware crash or hardware fault and is no longer operable, HA module141 is configured to trigger migration module 144 to migrate VMsexecuting in the failed host 121 to other hosts 121 within that cluster120. In addition, according various embodiments, HA module 141 isconfigured with a failure detection module 141A to monitor clusters 120for host faults. For example, failure detection module 141A periodicallypolls each host 121 in computing environment 100 for software crashesand/or hardware faults or evidence thereof. Alternatively oradditionally, failure detection module 141A may be configured to receivefault warnings from hosts 121 in each cluster 120.

DRS module 142 is configured to maintain the utilization of hosts 121 ineach cluster 120 between a minimum utilization threshold value and amaximum utilization threshold. To that end, DRS module 142 is configuredto determine whether utilization of hosts 121 in each cluster 120 isabove the minimum utilization threshold value and less than the maximumutilization threshold. For example, resource utilization monitor 142A inDRS module 142 polls the hosts 121 in each cluster 120 for utilizationinformation, such as central processing unit (CPU) time, allocatedmemory, and the like. In addition, DRS module 142 is configured toperform conventional load balancing between hosts 121 within aparticular cluster 120. For example, when resource utilization monitor142A determines that the utilization of computing resources of a firsthost 121 in a particular cluster 120 is less than a minimum utilizationthreshold value, DRS module 142 triggers migration module 144 to performmigration of executing VMs 122 from more highly loaded hosts 121 tofirst host, which is lightly loaded. In this way, DRS module 142maintains utilization of the computing resources of hosts 121 in aparticular cluster 120 within an optimal range.

Host provisioning module 143 is configured to logically add one or moreavailable hosts 131 to a cluster 120 or remove hosts 121 from thecluster 120 in response to a triggering event, so that utilization ofcomputing resources of hosts 121 and availability of VMs 122 executingon hosts 121 are maintained within an optimal range. A triggering eventmay be detection of an average utilization of hosts 121 in a cluster 120exceeding a maximum or minimum threshold value, or the detection of ahost 121 in the cluster 120 undergoing a partial or total failure.

In one embodiment, host provisioning module 143 is configured tologically add one or more available hosts 131 to a particular cluster120 when a host 121 in the particular cluster 120 suffers a partial ortotal failure. Afterwards, HA module 141 is able to migrate VMsexecuting in the failed host 121 to other hosts 121 within that cluster120 and/or to the newly added available host 131. For example, when alarge number of VMs 122 are executing in a particular cluster 120 and ahost failure occurs, the physical computing resources per VM 122 arereduced, i.e., CPU time, allocated memory, etc., and availability forany particular VM 122 in the cluster 120 is also reduced. Thus, hostprovisioning module 143 is configured to rectify this issue by logicallyadding one or more available hosts 131 to the cluster 120 based not onlyon the detection of a host failure, but also on an availability of theVMs 122 executing in the cluster 120. Thus, in such embodiments, hostprovisioning module 143 may logically add one or more available hosts131 to the cluster 120 experiencing the host failure when availabilityfor one or more VMs 122 in the cluster 120 cannot be increased above theminimum threshold value via migration of VMs 122 within the cluster 120.

In some embodiments, host provisioning module 143 is configured tologically add available hosts 131 to a particular cluster 120 in whichthe utilization of computing resources of the hosts 121 in theparticular cluster 120 is greater than a maximum utilization thresholdfor the cluster 120. Thus, host provisioning module 143 increases thetotal number hosts within the cluster when utilization is too high.Further, in some embodiments, host provisioning module 143 is configuredto logically remove hosts 121 from a particular cluster 120 in which theutilization of hosts 121 in the particular cluster 120 is less than aminimum utilization threshold value for the cluster 120. In suchembodiments, host provisioning module 143 decreases the total number ofhosts within the cluster when utilization is too low.

Metrics for quantifying availability of a particular VM 122 may include,for example, effective CPU resources available for the VM 122 ofinterest (in MHz or GHz), effective memory resources available for theVM 122 of interest (in kB, MB, or GB), a CPU fairness value representingthe fraction of host CPU resources allocated to the VM 122 of interest(in percent), and/or a memory fairness value representing the fractionof memory resources allocated to the VM 122 of interest (in percent).How and in what situations host provisioning module 143 logically addsavailable hosts 131 to and/or removes hosts 121 from a particularcluster 120 in response to host failures, low VM availability, and/orlow or high host utilization is described below in conjunction withFIGS. 4-7.

As noted above, in some embodiments, computing environment 100 mayinclude a failed host pool 150. Failed host pool 150 is a logicalconstruct with which hosts 120 that have undergone a partial or totalfailure can be associated. Hosts 120 that have undergone either partialor total failure are logically associated with failed host pool 150 asfailed hosts 151, so that these host devices cannot be subsequentlyadded to a cluster 120 of computing environment 100. Thus, failed hostpool 150 facilitates the identification of host devices in computingenvironment 100 that require maintenance, diagnostic analysis, and/orreplacement.

FIG. 2 schematically illustrates a cluster 120 of computing environment100 in FIG. 1, according to one or more embodiments. As shown, cluster120 includes up to N hosts 121, i.e., hosts 121A, 121B, . . . , 121N,each of which is communicatively coupled to VM management server 140. Inthe embodiment shown in FIG. 1, cluster 120 is a software-based “virtualstorage area network” (VSAN) environment that leverages the commoditylocal storage resources housed in or directly attached to hosts 121 incluster 120 to provide an aggregate object store 230 to virtual machines(VMs) 122 executing on hosts 121. Hereinafter, use of the term “housed”or “housed in” may be used to encompass both housed in or otherwisedirectly attached to.

Each host 121 is a computing device in which virtualization software anda plurality of VMs122 are executing. Each host 121 is typically aserver-class computer, although, in some embodiments, hosts 121 mayinclude a variety of classes of computers, such as mainframe computers,desktop computers, and laptop computers. Each host 121 includes avirtualization layer or hypervisor 213, a storage management module(referred to herein as a “VSAN module”) 214, and a hardware platform 225that typically includes central processing units (CPUs), random accessmemory (RAM), local storage resources, such as solid state drives (SSDs)226 and/or magnetic disks 227, a host bus adapter (HBA) that enableexternal storage devices to be connected to the host 120, and one ormore physical NICs that enable host 121 to communicate over a network.These physical computing resources of each host 121 are managed byhypervisor 213, and through hypervisor 213, a host 121 is able to launchand run multiple VMs 122.

The local commodity storage resources housed in or otherwise directlyattached to hosts 121 may include combinations of SSDs 226 and/ormagnetic or spinning disks 227. In certain embodiments, SSDs 226 serveas a read cache and/or write buffer in front of magnetic disks 227 toincrease I/O performance. VSAN module 214 in each host 121 is configuredto automate storage management workflows (e.g., creating objects inobject store 230) and provide access to objects in object store 230(e.g., handling I/O operations to objects in object store 230). For eachVM 122, VSAN module 214 may then create an “object” for the specifiedvirtual disk associated with the VM 122 by backing the virtual disk withphysical storage resources of object store 230 such as SSDs 226 and/ormagnetic disks 227. For example, SSD 226A and magnetic disk 227A of host121A, SSD 226B and magnetic disk 227B of host 121B, and so on may incombination be used to back object store 230.

Each VM 122 is a software abstraction of a physical computing systemthat is capable of running one or more applications under the control ofa guest operating system (not shown), where the guest operating systemprovides various operating system services (such as central processingunit (or CPU) allocation and memory management). The guest operatingsystem interacts with a virtual hardware platform, which is an emulatedhardware platform for the corresponding VM. Virtual hardware platforms(not depicted in the figure) are implemented on a particular host 121 byhypervisor 213, and typically comprise virtual CPUs, virtual RAM,virtual disks, and, for network communication, a virtual networkinterface controller (or NIC). That applications executing within VMs121 are executing in a virtual machine is transparent to operation ofthese applications. Thus, such applications may be installed in a VM 122unchanged from how such applications would be installed on a physicalcomputer.

Similarly, the fact that the guest operating system installed in each VM122 is executing on top of a virtualized hardware platform, rather thanon a physical hardware platform is transparent to the guest operatingsystem. Thus, the guest operating system may be installed in a VM 122 inthe same manner that the guest operating system is installed on aphysical computer. Examples of guest operating systems include thevarious versions of Microsoft's Windows® operating system, the Linuxoperating system, and Apple's Mac OS X.

Each VM 122 executing in a particular cluster 120 accesses computingservices by interacting with the virtual hardware platform associatedwith the particular cluster 120 and implemented by hypervisor 213. Asshown, each host 121 has one hypervisor 213 executing therein. As notedabove, hypervisor 213 provides a virtualization platform on which VMs122 execute. Hypervisor 213 enables each VM 122 that executes under itscontrol to access physical computing resources in host 120. Thus, whenthe guest operating system of a VM 122 schedules a process to be run ona virtual CPU of the virtual hardware platform, hypervisor 213 schedulesthe virtual CPU of that VM 122 to run on a physical CPU of host 120. Inanother example, an application executing in a VM 122 may requireadditional virtual RAM in order to execute a particular function. Inthis case, the application would issue a memory allocation request tothe guest operating system, which would allocate an amount of virtualRAM to satisfy the request. In turn, hypervisor 213 would allocate acertain amount of physical RAM that corresponds to the allocated virtualRAM.

Hypervisor 213 also manages the physical computing resources of thecorresponding host 120, and allocates those resources among theexecuting VMs 122, as well as other (i.e., non-VM) processes. Hypervisor213 allocates physical CPU time to virtual CPUs in the VMs 122, as wellas physical disk space to virtual disks for each of the VMs 122.Hypervisor 213 also enables the transmission and receipt of networktraffic (through the virtual NICs) for each VM 122.

VSAN module 214 can, in some embodiments, be implemented as a VSANdevice driver within hypervisor 213. In such embodiments, VSAN module214 provides access to a conceptual VSAN 215 through which anadministrator can create a number of top-level “device” or namespaceobjects that are backed by object store 230. In one common scenario,during creation of a device object, the administrator may specify aparticular file system for the device object (referred to as “filesystem objects”). For example, each hypervisor 213 in each host 120 may,during a boot process, discover a/vsan/root node for a conceptual globalnamespace that is exposed by VSAN module 214. By, for example, accessingAPIs exposed by VSAN module 214, hypervisor 213 can then determine thetop-level file system objects (or other types of top-level deviceobjects) currently residing in VSAN 215. When a VM 122 (or other client)attempts to access one of the file system objects in VSAN 215,hypervisor 213 may dynamically “auto-mount” the file system object atthat time. Each VSAN module 214 communicates with other VSAN modules 214of other hosts 121 in cluster 120 to create and maintain an in-memorymetadata database that contains metadata describing the locations,configurations, policies, and relationships among the various objectsstored in object store 230. Thus, in each host 121, such an in-memorymetadata database is maintained separately but in synchronized fashionin the memory of each host 121.

In the embodiment illustrated in FIG. 2, hosts 121 in cluster 120 areconnected to the same shared storage formed from aggregated localstorage resources, i.e., object store 230, via a VSAN, i.e., VSAN 215.Alternatively, cluster 120 may be configured to connect hosts 121 toshared storage using other approaches. For example, in one embodiment,the shared storage may be provided through a storage area network (SAN),and access to the SAN is provided by the HBA in each host 121. Inanother embodiment, the shared storage may be provided by a networkattached storage (NAS) which is accessed through the NIC in each host121.

FIGS. 3A-3C are conceptual diagrams illustrating the logical removal ofan available host 131 from spare host pool 130 and the logical additionof the available host 131 to one of the clusters 120, according to oneor more embodiments. In one embodiment, host provisioning module 143maintains a table for each cluster to track which hosts belong to whichcluster. When a host is logically added to a cluster, the hostidentifier (ID) is added to the table for that cluster, and when a hostis logically removed from a cluster, the host ID is deleted from thetable.

FIG. 3A illustrates spare host pool 130, failed host pool 150, and oneof clusters 120 of computing environment 100 prior to a triggering eventthat results in a change in the number of hosts 121 logically includedin cluster 120. As shown, spare host pool 131 includes a plurality ofavailable hosts 131, for example, five, cluster 120 includes multiplehosts 121, for example three, and failed host pool 150 includes onefailed host 151.

FIG. 3B illustrates spare host pool 130, failed host pool 150, and oneof clusters 120 of computing environment 100 after a triggering event isdetected. In some embodiments, such a triggering event may be thedetection of an average utilization of hosts 121 in cluster 120exceeding a maximum threshold value or an availability of a particularVM 122 (or group of VMs 122) falling below a minimum allowable thresholdvalue. In another embodiment, such a triggering event may be thedetection of a host 121 undergoing a partial or total failure. Partialfailures of a host 121 that may be considered a triggering event includethe failure of a component of the host that allows continued operationof the host 121, such as a partial memory failure, a fan failure, thefailure of a single disk of a multi-disk drive, the failure of a singlemagnetic disk drive in a host 121 that includes multiple magnetic diskdrives, etc. Total failures of a host 121 that may be considered atriggering event include a software-related host freeze, completefailure of the only magnetic disk drive in the host, and any otherfaults or failures that prevent the continued operation of the host 121.

In response to the above-described triggering event, host provisioningmodule 143 adds an available host 131 to cluster 120 as a new host 321,as shown in FIG. 3B. For example, in some situations, when a failed host322 is detected, HA module 141 may not be able to maintain anavailability of a particular VM 122 or group of VMs 122 if the VMs wereto be migrated from failed host 321 to different hosts 121 withincluster 120. In such a scenario, host provisioning module 143 increasesthe number of hosts 121 in the cluster 120 by logically adding anavailable host 131 from spare host pool 130, as shown. In addition, hostprovisioning module 143 logically removes failed host 321 from cluster120 and logically adds failed host 321 to failed host pool 150, asshown. In another example, when DRS module 142 cannot maintain autilization of all hosts 121 in a particular cluster 120 below a maximumallowable utilization by performing load balancing of hosts 121 withinthe cluster 120, host provisioning module 143 increases the number ofhosts 121 in the cluster 120 by adding an available host 131 from sparehost pool 130.

FIG. 3C illustrates spare host pool 130, failed host pool 150, and oneof clusters 120 of computing environment 100 after the number of hosts120 in cluster 120 has been increased, for example in response to afailure of a host 121 in cluster 120. As shown, spare host pool 131includes one fewer available hosts 131, for example, four, cluster 120includes the same number of hosts 121, for example three, and failedhost pool 150 includes an additional failed host 151, for example two.

FIG. 4 sets forth a flowchart of method steps carried out by VMmanagement server 140 to address low utilization in a cluster, accordingto one or more embodiments. Although the method steps in FIG. 4 aredescribed in conjunction with computing environment 100 of FIGS. 1-3,persons skilled in the art will understand that the method in FIG. 4 mayalso be performed with other types of computing systems, for example,any distributed computing system that includes a cluster of host devicesexecuting VMs.

As shown, a method 400 begins at step 401, in which resource utilizationmonitor 142A of VM management server 140 determines that a utilizationin a particular cluster 120 is less than a minimum threshold value. Forexample, in some embodiments, the utilization may be a utilization of aspecific host 121 in the particular cluster 120. Alternatively, theutilization may be a utilization associated with a group of hosts 121 inthe particular cluster 120, or of all hosts 121 in the particularcluster 120, such as an average utilization thereof. In step 401, theutilization is typically measured or quantified via performancemonitoring functions included in VM management server 140, and may bequantified in terms of computing resources in use by a host or hosts inthe particular cluster, such as percentage utilization of CPU, RAM, andthe like.

It is noted that because a utilization associated with the particularcluster 120 is less than the minimum threshold value, more computingresources are employed in the particular cluster 120 than are requiredto efficiently execute the VMs 122 currently running on the particularcluster 120. Consequently, VM management server 140 reduces the currentnumber of hosts 121 that are logically included in the particularcluster 120 via the subsequent steps of method 400.

In step 402, in response to the determination of step 401, hostprovisioning module 143 VM management server 140 selects a host 121 inthe particular cluster 120 to be logically removed therefrom. In someembodiments, the selected host is the host 121 in the particular cluster120 with the highest utilization, thereby maximizing the impact onutilization in the particular cluster 120. In some embodiments, once theselected host in step 402 is selected, additional write I/O's from VMsexecuting on hosts 121 to the local storage resources included in theselected host are not allowed, so that data stored locally on theselected host can be moved to the remaining hosts 121. By contrast, insuch embodiments, read I/O's are still permitted to the selected host,so that VMs 122 executing on hosts 121 can access object store 230 asneeded.

In step 403, DRS module 142 triggers migration of VMs (performed bymigration module 144) from the selected host to other hosts 121 in theparticular cluster. Techniques for load-balancing between hosts 121 thatare well-known in the art may be employed to complete migration of VMsfrom the selected host in step 403.

In step 404, DRS module 142 copies data that are stored, as part ofobject store 230, in local storage resources housed in the selectedhost. The data are copied to other local storage resources housed in oneor more other hosts 121 in the particular cluster 120. Upon completionof step 404, VMs executing in the particular cluster 120 no longeraccess the local storage resources housed in the selected host, sinceall file system objects associated with VSAN 215 are stored elsewherewithin the particular cluster 120.

In step 405, host provisioning module 143 logically removes the selectedhost from the particular cluster 120. In step 406, host provisioningmodule 143 logically adds the selected host to available host pool 130as an additional available host 131.

FIG. 5 sets forth a flowchart of method steps carried out by VMmanagement server 140 to address high utilization in a cluster,according to one or more embodiments. Although the method steps in FIG.5 are described in conjunction with computing environment 100 of FIGS.1-3, persons skilled in the art will understand that the method in FIG.5 may also be performed with other types of computing systems, forexample, any distributed computing system that includes a cluster ofhost devices executing VMs.

As shown, method 500 begins at step 501, in which resource utilizationmonitor 142A determines that a utilization in a particular cluster 120is higher than a maximum threshold value, where the utilization issubstantially similar to that described above in step 401 of method 400,and measured as described. It is noted that because a utilizationassociated with the particular cluster 120 is greater than the maximumthreshold value, insufficient computing resources are employed in theparticular cluster 120 than are required to provide failover capacityand/or to efficiently execute the VMs 122 currently running on theparticular cluster 120. Consequently, VM management server 140 increasesthe number of hosts 121 that are currently logically included in theparticular cluster 120 via the subsequent steps of method 500.

In step 502, in response to the determination of step 501, hostprovisioning module 143 selects a host from available host pool 130.

In step 503, host provisioning module 143 prepares the host selected instep 502 for use in the particular cluster 120. In some embodiments,step 503 includes imaging a hypervisor 213 on the selected host andconfiguring networking connections between the selected host andcomponents of cluster 120, for example, via the HBA and/or NICs of theselected available host 131. In such embodiments, the selected host isprovided permissions for accessing object store 230 and the variouslocal storage resources of the particular cluster 120, such as SSDs 226and/or magnetic disks 227.

In step 504, host provisioning module 143 logically adds the selectedhost to the particular cluster 120. For example, in some embodiments, acluster membership data structure associated with the particular cluster120 is modified with a unique host identifier for the selected hostbeing added. In addition, VSAN 215 is notified of the new physicalstorage addresses associated with the newly added host, since these newphysical storage addresses are used to back a portion of object store230. Thus, VSAN 215 is informed of the addition of the selected host.

FIG. 6 sets forth a flowchart of method steps carried out by VMmanagement server 140 in response to a failure of a host in a particularcluster 120, according to one or more embodiments. Although the methodsteps in FIG. 6 are described in conjunction with computing environment100 of FIGS. 1-3, persons skilled in the art will understand that themethod in FIG. 6 may also be performed with other types of computingsystems, for example, any distributed computing system that includes acluster of host devices executing VMs.

As shown, method 600 begins at step 601, in which failure detectionmodule 141A determines that a host 121 included in a particular cluster120 has experienced a failure. For example, host 121 may experience asoftware crash or hardware fault that prevents continued operation ofhost 121.

In optional step 602, in response to determining that host 121 in theparticular cluster 120 has experienced the crash or failure, VMmanagement server 140 determines whether the particular cluster 120currently includes sufficient operable hosts 121 for proper operation ofcluster 120. If yes, method 600 proceeds to step 610 and terminates; ifno, method 600 proceeds to step 603. Alternatively, optional step 602 isskipped and method 600 proceeds directly from step 601 to step 603.

In some embodiments, in step 602 host provisioning module 143 determineswhether the particular cluster 120 currently includes sufficientoperable hosts 121 based on whether the detected failure of the host 121results in the total number of operable hosts 121 in the particularcluster 120 to be less than a predetermined minimum threshold number ofhosts 121. For example, to provide sufficient failover capacity and/orredundancy in the particular cluster 120, a minimum of four operablehosts 120 may be in effect. Alternatively or additionally, in someembodiments, in step 602 VM management server 140 determines whether theparticular cluster 120 currently includes sufficient operable hosts 121based on the availability of the VMs 122 currently executing on thehosts 121 of the particular cluster 120. That is, VM management server140 may determine whether the availability of the VMs 122 currentlyexecuting on the hosts 121 of the particular cluster 120 is greater thana minimum requirement or target. In such embodiments, VM managementserver 140 may determine availability based on whether there aresufficient computing resources for these VMs to execute with anacceptable latency, such as CPU processing time, available memory, andthe like.

In step 603, in response to the determination of step 601, hostprovisioning module 143 selects a host from available host pool 130.

In step 604, host provisioning module 143 prepares the host selected instep 603 for use in the particular cluster 120. Generally, step 604 inmethod 600 may be substantially similar to step 503 in method 500.

In step 605, host provisioning module 143 logically adds the selectedhost 131 to the particular cluster 120. Generally, step 605 in method600 may be substantially similar to step 504 in method 500.

Thus, implementation of method 600 enables the number of operable hosts121 that are currently logically included in a particular cluster 120 tobe increased when insufficient computing resources are employed in theparticular cluster 120 as a result of a host failure.

FIG. 7 sets forth a flowchart of method steps carried out by VMmanagement server 140 in response to a partial failure of a host in aparticular cluster 120, according to one or more embodiments. Althoughthe method steps in FIG. 7 are described in conjunction with computingenvironment 100 of FIGS. 1-3, persons skilled in the art will understandthat the method in FIG. 7 may also be performed with other types ofcomputing systems, for example, any distributed computing system thatincludes a cluster of host devices executing VMs.

As shown, method 700 begins at step 701, in which failure detectionmodule 141A determines that a host 121 included in a particular cluster120 has experienced a partial failure, and is compromised. For example,the host 121 may experience the failure of a component that allowscontinued operation of the host 121, such as a partial memory failure, afan failure, the failure of a single disk of a multi-disk drive, thefailure of a single magnetic disk drive in a host 121 that includesmultiple magnetic disk drives, and the like. Such a host 121 is referredto hereinafter as a “compromised host.” According to some embodiments,because operation of the compromised host is at least partiallycompromised, the compromised host is subsequently replaced in theparticular cluster 120 with an available host 131 via the subsequentsteps of method 700.

In optional step 702, in response to determining that a host 121 in theparticular cluster 120 has experienced the partial failure, hostprovisioning module 143 determines whether the particular cluster 120currently includes sufficient operable hosts 121 for proper operation ofcluster 120. If yes, method 700 proceeds directly to step 706; if no,method 700 proceeds to step 703. Alternatively, optional step 702 isskipped and method 700 proceeds directly from step 701 to step 703.

In step 703, host provisioning module 143 selects a host from availablehost pool 130.

In step 704, host provisioning module 143 prepares the host selected instep 703 for use in the particular cluster 120. Generally, step 704 inmethod 700 may be substantially similar to step 503 in method 500.

In step 705, host provisioning module 143 logically adds the selectedhost 131 to the particular cluster 120. Generally, step 705 in method700 may be substantially similar to step 504 in method 500.

In step 706, host provisioning module 143 copies data that are stored,as part of object store 230, in local storage resources housed in thecompromised host. The data are copied to other local storage resourceshoused in one or more other hosts 121 in the particular cluster 120,and/or to the newly added available host 131. Upon completion of step705, VMs executing in the particular cluster 120 no longer access thelocal storage resources housed in the compromised host, since all filesystem objects associated with VSAN 215 are stored elsewhere within theparticular cluster 120.

In step 707, HA module 141 triggers migration of VMs (performed bymigration module 144) from the compromised host to other hosts 121 inthe particular cluster 120. In embodiments in which an available host131 is selected in step 703, some or all VMs executing on thecompromised host are migrated to the newly added available host 131.Alternatively, the VMs executing on the compromised host are insteaddistributed among the other hosts 121 in the particular cluster 120.Techniques for load-balancing between hosts 121 that are well-known inthe art may be employed to complete migration of VMs from thecompromised host in step 707.

Thus, implementation of method 700 enables the number of operable hosts121 that are currently logically included in a particular cluster 120 tobe increased when insufficient computing resources are employed in theparticular cluster 120 as a result of a partial host failure.

Certain embodiments as described above involve a hardware abstractionlayer on top of a host computer. The hardware abstraction layer allowsmultiple contexts or virtual computing instances to share the hardwareresource. In one embodiment, these virtual computing instances areisolated from each other, each having at least a user applicationrunning therein. The hardware abstraction layer thus provides benefitsof resource isolation and allocation among the virtual computinginstances. In the foregoing embodiments, virtual machines are used as anexample for the virtual computing instances and hypervisors as anexample for the hardware abstraction layer. As described above, eachvirtual machine includes a guest operating system in which at least oneapplication runs. It should be noted that these embodiments may alsoapply to other examples of virtual computing instances, such ascontainers not including a guest operating system, referred to herein as“OS-less containers” (see, e.g., docker.com). OS-less containersimplement operating system-level virtualization, wherein an abstractionlayer is provided on top of the kernel of an operating system on a hostcomputer. The abstraction layer supports multiple OS-less containerseach including an application and its dependencies. Each OS-lesscontainer runs as an isolated process in user space on the hostoperating system and shares the kernel with other containers. TheOS-less container relies on the kernel's functionality to make use ofresource isolation (CPU, memory, block I/O, network, etc.) and separatenamespaces and to completely isolate the application's view of theoperating environments. By using OS-less containers, resources can beisolated, services restricted, and processes provisioned to have aprivate view of the operating system with their own process ID space,file system structure, and network interfaces. Multiple containers canshare the same kernel, but each container can be constrained to only usea defined amount of resources such as CPU, memory and I/O.

The various embodiments described herein may employ variouscomputer-implemented operations involving data stored in computersystems. For example, these operations may require physical manipulationof physical quantities—usually, though not necessarily, these quantitiesmay take the form of electrical or magnetic signals, where they orrepresentations of them are capable of being stored, transferred,combined, compared, or otherwise manipulated. Further, suchmanipulations are often referred to in terms, such as producing,identifying, determining, or comparing. Any operations described hereinthat form part of one or more embodiments of the invention may be usefulmachine operations. In addition, one or more embodiments of theinvention also relate to a device or an apparatus for performing theseoperations. The apparatus may be specially constructed for specificrequired purposes, or it may be a general purpose computer selectivelyactivated or configured by a computer program stored in the computer. Inparticular, various general purpose machines may be used with computerprograms written in accordance with the teachings herein, or it may bemore convenient to construct a more specialized apparatus to perform therequired operations.

The various embodiments described herein may be practiced with othercomputer system configurations including hand-held devices,microprocessor systems, microprocessor-based or programmable consumerelectronics, minicomputers, mainframe computers, and the like.

One or more embodiments of the present invention may be implemented asone or more computer programs or as one or more computer program modulesembodied in one or more computer readable media. The term computerreadable medium refers to any data storage device that can store datawhich can thereafter be input to a computer system—computer readablemedia may be based on any existing or subsequently developed technologyfor embodying computer programs in a manner that enables them to be readby a computer. Examples of a computer readable medium include a harddrive, network attached storage (NAS), read-only memory, random-accessmemory (e.g., a flash memory device), a CD (Compact Discs)-CD-ROM, aCD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, andother optical and non-optical data storage devices. The computerreadable medium can also be distributed over a network coupled computersystem so that the computer readable code is stored and executed in adistributed fashion.

Although one or more embodiments of the present invention have beendescribed in some detail for clarity of understanding, it will beapparent that certain changes and modifications may be made within thescope of the claims. Accordingly, the described embodiments are to beconsidered as illustrative and not restrictive, and the scope of theclaims is not to be limited to details given herein, but may be modifiedwithin the scope and equivalents of the claims. In the claims, elementsand/or steps do not imply any particular order of operation, unlessexplicitly stated in the claims.

Virtualization systems in accordance with the various embodiments may beimplemented as hosted embodiments, non-hosted embodiments or asembodiments that tend to blur distinctions between the two, are allenvisioned. Furthermore, various virtualization operations may be whollyor partially implemented in hardware. For example, a hardwareimplementation may employ a look-up table for modification of storageaccess requests to secure non-disk data.

Many variations, modifications, additions, and improvements arepossible, regardless the degree of virtualization. The virtualizationsoftware can therefore include components of a host, console, or guestoperating system that performs virtualization functions. Pluralinstances may be provided for components, operations or structuresdescribed herein as a single instance. Finally, boundaries betweenvarious components, operations and data stores are somewhat arbitrary,and particular operations are illustrated in the context of specificillustrative configurations. Other allocations of functionality areenvisioned and may fall within the scope of the invention(s). Ingeneral, structures and functionality presented as separate componentsin exemplary configurations may be implemented as a combined structureor component. Similarly, structures and functionality presented as asingle component may be implemented as separate components. These andother variations, modifications, additions, and improvements may fallwithin the scope of the appended claim(s).

We claim:
 1. In a data center comprising a cluster of hosts, a sparehost pool, and a failed host pool, a method of adjusting the number ofhosts in the cluster, comprising: tracking membership of hosts in thecluster using a first data structure; tracking membership of hosts inthe spare host pool using a second data structure; tracking membershipof hosts in the failed host pool using a third data structure;determining that the cluster does not include a sufficient number ofoperable hosts based on determining that an availability of one or morevirtual machines (VMs) on a first host in the cluster; selecting asecond host from the spare host pool; adding an identifier of theselected second host to the first data structure; migrating the VM fromthe first host to the second host; adding an identifier of the firsthost to the third data structure; and deleting the identifier of thefirst host from the first data structure.
 2. The method of claim 1,wherein determining that the cluster does not include a sufficientnumber of operable hosts is based on determining the existence of apartial or total failure of the first host in the cluster.
 3. The methodof claim 1, wherein the VM is migrated from the first host to the secondhost upon a failure of the first host.
 4. The method of claim 1, furthercomprising: copying data locally stored in the first host to a storagedevice accessible by the cluster.
 5. The method of claim 1, wherein theVM is migrated from the first host to the second host when an averageresource utilization in the first host is greater than an upperthreshold utilization.
 6. The method of claim 5, wherein the resource isCPU or memory.
 7. The method of claim 6, further comprising: copyingdata locally stored in a host with the lowest resource utilization to astorage device accessible by the cluster.
 8. A non-transitorycomputer-readable medium comprising instructions that are executable ina computing device to cause the computing device to at least: trackmembership of hosts in the cluster using a first data structure; trackmembership of hosts in the spare host pool using a second datastructure; track membership of hosts in the failed host pool using athird data structure; determine that the cluster does not include asufficient number of operable hosts based on determining that anavailability of one or more virtual machines (VMs) on a first host inthe cluster; select a second host from the spare host pool; add anidentifier of the selected second host to the first data structure;migrate the VM from the first host to the second host; add an identifierof the first host to the third data structure; and delete the identifierof the first host from the first data structure.
 9. The non-transitorycomputer-readable medium of claim 8, wherein determining that thecluster does not include a sufficient number of operable hosts is basedon determining the existence of a partial or total failure of the firsthost in the cluster.
 10. The non-transitory computer-readable medium ofclaim 8, wherein the VM is migrated from the first host to the secondhost upon a failure of the first host.
 11. The non-transitorycomputer-readable medium of claim 8, wherein the instructions furthercause the computing device to copy data locally stored in the first hostto a storage device accessible by the cluster.
 12. The non-transitorycomputer-readable medium of claim 8, wherein the VM is migrated from thefirst host to the second host when an average resource utilization inthe first host is greater than an upper threshold utilization.
 13. Thenon-transitory computer-readable medium of claim 12, wherein theresource is CPU or memory.
 14. The non-transitory computer-readablemedium of claim 13, wherein the instructions cause the computing deviceto copy data locally stored in a host with the lowest resourceutilization to a storage device accessible by the cluster.
 15. A systemof computers, comprising: a cluster of hosts; a spare pool of hosts; anda management server configured to: track membership of hosts in thecluster using a first data structure; track membership of hosts in thespare host pool using a second data structure; track membership of hostsin the failed host pool using a third data structure; determine that thecluster does not include a sufficient number of operable hosts based ondetermining that an availability of one or more virtual machines (VMs)on a first host in the cluster; select a second host from the spare hostpool; add an identifier of the selected second host to the first datastructure; migrate the VM from the first host to the second host; add anidentifier of the first host to the third data structure; and delete theidentifier of the first host from the first data structure.
 16. Thesystem of claim 15, wherein determining that the cluster does notinclude a sufficient number of operable hosts is based on determiningthe existence of a partial or total failure of the first host in thecluster.
 17. The system of claim 15, wherein the VM is migrated from thefirst host to the second host upon a failure of the first host.
 18. Thesystem of claim 15, further comprising copying data locally stored inthe first host to a storage device accessible by the cluster.
 19. Thesystem of claim 15, wherein the VM is migrated from the first host tothe second host when an average resource utilization in the first hostis greater than an upper threshold utilization.
 20. The system of claim19, wherein the resource is CPU or memory.